Fine-tuning an LLM on your texts: part 3 - curating the dataset

This is the 3rd installment of my guide to creating a simulation of yourself based on your text messages, as I did. In this part, we create and upload the dataset — the final bit of data wrangling before we get to the juicy part of training. The previous installment is here.

Clumping texts into Documents

I started with the jupyter notebook from the previous post. I then created a Document class representing a chunk of texts with the same person. This is the natural language that we will be using to train our model.

class Document:

    def __init__(self, name, messages):
        self.name = name
        self.messages = messages

    def to_text(self):
        result = "<<SYS>>Write a realistic text message chat. Avoid repetition.<</SYS>>\n"
        result += f"[INST]Write a chat between Edward and {self.name}[/INST]\n"
        result += " ".join(f"### {message.sender}: {message.text}" for message in self.messages)
        return result

    def token_len(self):
        return len(tokenizer.encode(self.to_text()))

You’ll see that the to_text() method creates a string that starts with a prompt, and is followed by each message. I tried many, many variations and I made these discoveries:

Shorter prompts worked well; no need to provide detailed instructions. In my earlier versions, I tried using the prompt to sternly command the LLM to not repeat itself, but it made little difference. It was much more effective to apply constraints during text generation, which we will get to later.
In theory, the [INST] instruction is intended to be within the <<SYS>> prompt. I found it worked better having one after the other. Your Mileage May Vary.
I experimented a ton with the use of ### and : to separate the messages, and whether to put each message on a separate line. The approach above seemed to work best for me. I’m quite sure there’s a room to improve the results by refining this structure. If you’ve found something better – please let me know!

Next I wrote a finicky piece of code to group chats into Documents, where each Document fits within 200 tokens and crams in as many texts with the same person as possible. This will be perfect for our dataset.

documents = []
for name, message_list in tqdm.tqdm(chats.items()):
    pointer = 0
    while pointer < len(message_list):
        size = 1
        while pointer + size < len(message_list):
            next_doc = Document(name, message_list[pointer:pointer+size+1])
            if next_doc.token_len()>=MAX_LENGTH:
                break
            size += 1
        document = Document(name, message_list[pointer:pointer+size])
        documents.append(document)
        pointer += size
print(f"{len(documents):,} documents")

# I get 26,104 documents

Investigating the Documents

We now dig into these documents:

data = [doc.to_text() for doc in documents]
lengths = [doc.token_len() for doc in documents]
counts = sum(d.count('###') for d in data)

print(f'There are {counts:,} messages; average {counts/len(documents):.2} messages in each of {len(documents):,} documents')

# I get: There are 240,985 messages; average 9.2 messages in each of 26,104 documents

Side note: you may be wondering why I choose 200 as the max sequence length. The answer is deeply unsatisfactory: it was trial and error, and 200 worked well. Larger sizes slowed down training and caused memory issues; smaller numbers reduced the quality of the results. Perhaps 9 messages is the typical length of my text conversations. If you have more substantive text chats than me, you may do better with longer sequences.

Another quick side note: you may notice that some documents have more than 200 tokens, if the prompt and the first message already exceeds 200. In that case, it will be truncated to 200 during training.

We can visualize the size of each Document in the dataset, which helps show that our texts are being packed pretty nicely into 200 token chunks.

fig, ax = plt.subplots(1, 1)
ax.set_xlabel('Number of tokens in a document')
ax.set_ylabel('Count of documents')
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda y, p: format(int(y), ',')))
l2 = [min(MAX_LENGTH+100,l) for l in lengths]
_ = ax.hist(l2, bins=range(0,MAX_LENGTH+50,10), color='darkorange', rwidth=0.5)

You should also take a moment to look at some of the documents to convince yourself that everything is working well. And, while you’re there, take a trip down memory lane by surfacing random texts from your past.

Mixing and Encrypting

I shuffled up the documents for training. Obviously within each document, the messages remain in the right order.

random.seed(42) # for reproducibility
random.shuffle(data)

I then encrypted all the text data. This shouldn’t be needed, as we will be storing this privately on the Hugging Face hub. But I wanted to be extra cautious to keep my text message history safe. Note down the key; we will be using this in Google Colab after downloading the dataset.

key = Fernet.generate_key()
print(key)
f = Fernet(key)
data = [f.encrypt(d.encode('utf-8')) for d in data]

Crafting the dataset and uploading

Next, split the data into a training and test dataset, holding back 5% of the data for eval.

split = int(0.95 * len(data))
train, test = data[:split], data[split:]

Then convert this into the DatasetDict format for Hugging Face:

from datasets import Dataset, DatasetDict
train_dataset = Dataset.from_dict({'text': train})
test_dataset = Dataset.from_dict({'text': test})
dataset = DatasetDict({'train': train_dataset, 'test': test_dataset})

Log in to Hugging Face in your notebook using the instructions in Part 1, ensuring that your token gives you permission to write. Then:

dataset.push_to_hub(DATA_NAME, private=True)

You should now admire your work by logging in to Hugging Face and clicking on your profile page. You will find your spanking new dataset waiting for you. In the next installment, we will go back to Google Colab, access this dataset, and get to fine-tuning. The next post is here.